From the Berkeley Earth Data page, this dataset in made up or temperature recordings from the Earth’s surface.
The data ranges from November 1st, 1743 to December 1st, 2013. The dataset files used are:
GlobalLandTemperaturesByMajorCity GlobalLandTemperaturesByState.
And also Continents data to be integrated to aid in attainment of objectives, which is from Machin github.
This data is made of columns of date, Country, Average temperature, temperature uncertainty,City, Latitude and Longitude.
This data is made of columns of date, Country, Average temperature, temperature uncertainty,State.
It contains data of columns name, alpha-2, alpha-3, region, region code, sub-region and ISO code.
The goals of the undertaking;
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime, timedelta
from matplotlib import pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMA
import scipy
from itertools import product
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.statespace.sarimax import SARIMAX
import warnings
warnings.filterwarnings('ignore')
C:\Users\kwabe\anaconda3\lib\site-packages\scipy\__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.3
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
#load of dataset
df = pd.read_csv('GlobalLandTemperaturesByMajorCity.csv')
forecast_df = df.copy()
#A look on the cities
city_data = df.drop_duplicates(['City'])
city_data.head()
| dt | AverageTemperature | AverageTemperatureUncertainty | City | Country | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 1849-01-01 | 26.704 | 1.435 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 1977 | 1850-01-01 | 15.986 | 1.537 | Addis Abeba | Ethiopia | 8.84N | 38.11E |
| 3942 | 1796-01-01 | 19.649 | 2.286 | Ahmadabad | India | 23.31N | 72.52E |
| 6555 | 1791-05-01 | 20.836 | 1.993 | Aleppo | Syria | 36.17N | 37.79E |
| 9224 | 1791-05-01 | 20.772 | 1.848 | Alexandria | Egypt | 31.35N | 30.16E |
#inference on the data
df.shape
(239177, 7)
df.columns
Index(['dt', 'AverageTemperature', 'AverageTemperatureUncertainty', 'City',
'Country', 'Latitude', 'Longitude'],
dtype='object')
df.head()
| dt | AverageTemperature | AverageTemperatureUncertainty | City | Country | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 1849-01-01 | 26.704 | 1.435 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 1 | 1849-02-01 | 27.434 | 1.362 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 2 | 1849-03-01 | 28.101 | 1.612 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 3 | 1849-04-01 | 26.140 | 1.387 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 4 | 1849-05-01 | 25.427 | 1.200 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 239177 entries, 0 to 239176 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dt 239177 non-null object 1 AverageTemperature 228175 non-null float64 2 AverageTemperatureUncertainty 228175 non-null float64 3 City 239177 non-null object 4 Country 239177 non-null object 5 Latitude 239177 non-null object 6 Longitude 239177 non-null object dtypes: float64(2), object(5) memory usage: 12.8+ MB
df.rename(columns={'dt':'Date','AverageTemperature':'Avg_temp','AverageTemperatureUncertainty':'AvgTempUncertainty','Latitude':'Latitude','Longitude':'Longitude'}, inplace=True)
df.head()
| Date | Avg_temp | AvgTempUncertainty | City | Country | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 1849-01-01 | 26.704 | 1.435 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 1 | 1849-02-01 | 27.434 | 1.362 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 2 | 1849-03-01 | 28.101 | 1.612 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 3 | 1849-04-01 | 26.140 | 1.387 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 4 | 1849-05-01 | 25.427 | 1.200 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
#Checking for any null values
df.isnull().values.any()
True
df = df.dropna(how='any', axis=0)
df.shape
(228175, 7)
#Statistical inference on the data
df.describe()
| Avg_temp | AvgTempUncertainty | |
|---|---|---|
| count | 228175.000000 | 228175.000000 |
| mean | 18.125969 | 0.969343 |
| std | 10.024800 | 0.979644 |
| min | -26.772000 | 0.040000 |
| 25% | 12.710000 | 0.340000 |
| 50% | 20.428000 | 0.592000 |
| 75% | 25.918000 | 1.320000 |
| max | 38.283000 | 14.037000 |
df['Date2'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date2'].dt.year
#Group by year
by_year = df.groupby(by =['Year','City','Country','Latitude','Longitude']).mean().reset_index()
world_map = pd.read_csv('continents2.csv') # load the continent data
world_map.head()
| name | alpha-2 | alpha-3 | country-code | iso_3166-2 | region | sub-region | intermediate-region | region-code | sub-region-code | intermediate-region-code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AF | AFG | 4 | ISO 3166-2:AF | Asia | Southern Asia | NaN | 142.0 | 34.0 | NaN |
| 1 | Åland Islands | AX | ALA | 248 | ISO 3166-2:AX | Europe | Northern Europe | NaN | 150.0 | 154.0 | NaN |
| 2 | Albania | AL | ALB | 8 | ISO 3166-2:AL | Europe | Southern Europe | NaN | 150.0 | 39.0 | NaN |
| 3 | Algeria | DZ | DZA | 12 | ISO 3166-2:DZ | Africa | Northern Africa | NaN | 2.0 | 15.0 | NaN |
| 4 | American Samoa | AS | ASM | 16 | ISO 3166-2:AS | Oceania | Polynesia | NaN | 9.0 | 61.0 | NaN |
world_map['Country'] = world_map['name']
world_map = world_map[['Country', 'region','alpha-2', 'alpha-3']]
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objects as go
import colorlover as cl
from plotly.subplots import make_subplots
data_new = pd.merge(left = by_year, right = world_map, on = 'Country', how = 'left')
#Year specifically chosen 1840 to represnt industrial revolution
data_new = data_new[data_new['Year']>= 1840]
#region, year and city assign
region = data_new.dropna(axis = 0).groupby(by = ['region', 'Year']).mean().reset_index()
df_new = data_new.dropna(axis = 0).groupby(by = ['region', 'Country','Year']).mean().reset_index()
City = data_new.dropna(axis = 0).groupby(by = ['region', 'Country','City','Year', 'Latitude','Longitude']).mean().reset_index()
#Continents determination
continent = make_subplots(rows=1, cols=2, insets=[{'cell':(1,1), 'l': 0.7, 'b': 0.3}])
continent.update_layout(title="Max and Average Temperatures by Continents", title_font_size = 20,
template = "ggplot2", hovermode = "closest")
continent.update_xaxes(showline=True, linewidth=1, linecolor='gray')
continent.update_yaxes(showline=True, linewidth=1, linecolor='gray')
continent.add_trace(go.Scatter(x= region[region['region']=='Europe']['Year'], y = region[region['region']=='Europe']
['Avg_temp'], name='Europe', marker_color='rgb(120,0,0)'), row=1,col=1)
continent.add_trace(go.Scatter(x= region[region['region']=='Americas']['Year'], y = region[region['region']=='Americas']
['Avg_temp'], name='Americas', marker_color='rgb(200, 100,40)'), row=1,col=1)
continent.add_trace(go.Scatter(x= region[region['region']=='Asia']['Year'], y = region[region['region']=='Asia']
['Avg_temp'], name='Asia', marker_color='rgb(145,205,56)'), row=1,col=1)
continent.add_trace(go.Scatter(x= region[region['region']=='Africa']['Year'], y = region[region['region']=='Africa']
['Avg_temp'], name='Africa', marker_color='rgb(125,145,56)'), row=1,col=1)
continent.add_trace(go.Scatter(x= region[region['region']=='Oceania']['Year'], y = region[region['region']=='Oceania']
['Avg_temp'], name='Oceania', marker_color='rgb(245,105,150)'), row=1,col=1)
leftside = np.round(region.groupby(by = 'region')['Avg_temp'].mean().tolist(), 1)
rightside = np.round(region.groupby(by = 'region')['Avg_temp'].max().tolist(), 1)
continent.add_trace(go.Bar(x = region['region'].unique(), y = region.groupby(by ='region')['Avg_temp'].mean().tolist(),
name = 'Average temperature', marker_color ='rgb(3, 121, 110)', text = leftside, textposition = 'auto'),
row = 1, col =2)
continent.add_trace(go.Bar(x = region['region'].unique(), y = region.groupby(by ='region')['Avg_temp'].max().tolist(),
name = 'Maximum temperature', marker_color ='rgb(255, 134, 10)', text = rightside, textposition = 'auto'),
row = 1, col =2)
#Next is the temperature differences among countries , calculated and visualised
meanvalue = df_new.groupby(['Country','region'])['Avg_temp'].mean().reset_index()
maxvalue = df_new.groupby(['Country','region'])['Avg_temp'].max().reset_index()
total = pd.merge(left = meanvalue, right = maxvalue, on = ['Country','region'])
total['diff'] = total['Avg_temp_y'] - total['Avg_temp_x']
rank = go.Figure()
rank.update_layout(title='Difference in the Temperature Among Countries', title_font_size = 20,
template = "ggplot2", autosize = False, height = 3000, width = 900)
rank.update_xaxes(showline=True, linewidth=1)
rank.update_yaxes(showline=True, linewidth=1)
sort_diff = total[['Country','region', 'diff']].sort_values(by = 'diff', ascending = True)
rank.add_trace(go.Bar(x = sort_diff['diff'], y = sort_diff['Country'], orientation = 'h',
marker = dict(color='rgb(113,168,131)', line = dict(color='rgb(156,114,250)',width=0.6))))
rank.show()
world_map = data_new.dropna(axis =0).groupby(by = ['region','Country','Year','alpha-3']).mean().reset_index()
world_map['Avg_temp'] = world_map['Avg_temp'] + 6 #due to negative in values
wmap = px.scatter_geo(world_map, locations='alpha-3', color='region',
color_discrete_sequence = ('rgb(154, 65, 17)','rgb(152, 0, 73)', 'rgb(249, 160, 26)', 'rgb(128,255,25)','rgb(255,0, 0)'),
hover_name='Country', size="Avg_temp", size_max = 20, opacity = 0.5,
animation_frame="Year", projection="natural earth", title='World Map Temperature')
wmap.show()
#A look at Italy first
italy_country = df[df['Country']=='Italy']
italy_country = italy_country.reset_index()
italy_country = italy_country.drop(columns=['index'])
italy_country.Date = pd.to_datetime(italy_country.Date)
YEAR = []
MONTH = []
DAY = []
WEEKDAY = []
for i in range(len(italy_country)):
WEEKDAY.append(italy_country.Date[i].weekday())
DAY.append(italy_country.Date[i].day)
MONTH.append(italy_country.Date[i].month)
YEAR.append(italy_country.Date[i].year)
italy_country['Year'] = YEAR
italy_country['Month'] = MONTH
italy_country['Day'] = DAY
italy_country['Weekday'] = WEEKDAY
change_year_index = []
change_year = []
year_list = italy_country['Year'].tolist()
for y in range(0,len(year_list)-1):
if year_list[y]!=year_list[y+1]:
change_year.append(year_list[y+1])
change_year_index.append(y+1)
italy_country.loc[change_year_index].head()
| Date | Avg_temp | AvgTempUncertainty | City | Country | Latitude | Longitude | Date2 | Year | Month | Day | Weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1744-04-01 | 12.154 | 2.006 | Rome | Italy | 42.59N | 13.09E | 1744-04-01 | 1744 | 4 | 1 | 2 |
| 9 | 1745-01-01 | 1.583 | 2.056 | Rome | Italy | 42.59N | 13.09E | 1745-01-01 | 1745 | 1 | 1 | 4 |
| 13 | 1750-01-01 | 3.095 | 1.835 | Rome | Italy | 42.59N | 13.09E | 1750-01-01 | 1750 | 1 | 1 | 3 |
| 24 | 1751-01-01 | 2.860 | 2.056 | Rome | Italy | 42.59N | 13.09E | 1751-01-01 | 1751 | 1 | 1 | 4 |
| 32 | 1752-01-01 | 1.760 | 2.372 | Rome | Italy | 42.59N | 13.09E | 1752-01-01 | 1752 | 1 | 1 | 5 |
#Average Temperature in Italy from !950 to 2013
a1950=italy_country[italy_country.Year>=1950]
plt.figure(figsize=(15,4))
plt.subplot(121)
sns.lineplot(x = a1950["Year"], y = a1950["Avg_temp"])
plt.title("Average Temperature and Year")
plt.xlabel("Year")
plt.ylabel("Average Temperature")
plt.xticks(rotation = 45)
plt.show()
Rome_data = df[df['City']=='Rome']
Rome_data = Rome_data.reset_index()
Rome_data = Rome_data.drop(columns=['index'])
Rome_data.Date = pd.to_datetime(Rome_data.Date)
YEAR = []
MONTH = []
DAY = []
WEEKDAY = []
for i in range(len(Rome_data)):
WEEKDAY.append(Rome_data.Date[i].weekday())
DAY.append(Rome_data.Date[i].day)
MONTH.append(Rome_data.Date[i].month)
YEAR.append(Rome_data.Date[i].year)
Rome_data['Year'] = YEAR
Rome_data['Month'] = MONTH
Rome_data['Day'] = DAY
Rome_data['Weekday'] = WEEKDAY
change_year_index = []
change_year = []
year_list = Rome_data['Year'].tolist()
for y in range(0,len(year_list)-1):
if year_list[y]!=year_list[y+1]:
change_year.append(year_list[y+1])
change_year_index.append(y+1)
Rome_data.loc[change_year_index].head()
| Date | Avg_temp | AvgTempUncertainty | City | Country | Latitude | Longitude | Date2 | Year | Month | Day | Weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1744-04-01 | 12.154 | 2.006 | Rome | Italy | 42.59N | 13.09E | 1744-04-01 | 1744 | 4 | 1 | 2 |
| 9 | 1745-01-01 | 1.583 | 2.056 | Rome | Italy | 42.59N | 13.09E | 1745-01-01 | 1745 | 1 | 1 | 4 |
| 13 | 1750-01-01 | 3.095 | 1.835 | Rome | Italy | 42.59N | 13.09E | 1750-01-01 | 1750 | 1 | 1 | 3 |
| 24 | 1751-01-01 | 2.860 | 2.056 | Rome | Italy | 42.59N | 13.09E | 1751-01-01 | 1751 | 1 | 1 | 4 |
| 32 | 1752-01-01 | 1.760 | 2.372 | Rome | Italy | 42.59N | 13.09E | 1752-01-01 | 1752 | 1 | 1 | 5 |
Rome1950 = Rome_data[Rome_data.Year>=1950]
plt.figure(figsize=(15,4))
plt.subplot(121)
sns.lineplot(x = Rome1950["Year"], y = Rome1950["Avg_temp"])
plt.title("Average Temperature and Year")
plt.xlabel("Year")
plt.ylabel("Average Temperature")
plt.xticks(rotation = 45)
plt.show()
last_year_data = Rome_data[Rome_data.Year>=2010].reset_index().drop(columns=['index'])
P = np.linspace(0,len(last_year_data)-1,5).astype(int)
def get_timeseries(start_year,end_year):
last_year_data = Rome_data[(Rome_data.Year>=start_year) & (Rome_data.Year<=end_year)].reset_index().drop(columns=['index'])
return last_year_data
def plot_timeseries(start_year,end_year):
last_year_data = get_timeseries(start_year,end_year)
P = np.linspace(0,len(last_year_data)-1,5).astype(int)
plt.plot(last_year_data.Avg_temp,marker='.',color='blue')
plt.xticks(np.arange(0,len(last_year_data),1)[P],last_year_data.Date.loc[P],rotation=60)
plt.xlabel('Date (Y/M/D)')
plt.ylabel('Average Temperature')
def plot_from_data(data,time,c='firebrick',with_ticks=True,label=None):
time = time.tolist()
data = np.array(data.tolist())
P = np.linspace(0,len(data)-1,5).astype(int)
time = np.array(time)
if label==None:
plt.plot(data,marker='.',color=c)
else:
plt.plot(data,marker='.',color=c,label=label)
if with_ticks==True:
plt.xticks(np.arange(0,len(data),1)[P],time[P],rotation=60)
plt.xlabel('Date (Y/M/D)')
plt.ylabel('Average Temperature')
plt.figure(figsize=(20,20))
plt.suptitle('Plotting 4 decades',fontsize=40,color='black')
plt.subplot(2,2,1)
plt.title('Starting year: 1800, Ending Year: 1810',fontsize=15)
plot_timeseries(1800,1810)
plt.subplot(2,2,2)
plt.title('Starting year: 1900, Ending Year: 1910',fontsize=15)
plot_timeseries(1900,1910)
plt.subplot(2,2,3)
plt.title('Starting year: 1950, Ending Year: 1960',fontsize=15)
plot_timeseries(1900,1910)
plt.subplot(2,2,4)
plt.title('Starting year: 2000, Ending Year: 2010',fontsize=15)
plot_timeseries(1900,1910)
plt.tight_layout()
#Check stationarity
result = adfuller(Rome_data.Avg_temp)
print('ADF Statistic on the entire dataset: {}'.format(result[0]))
print('p-value: {}'.format(result[1]))
print('Critical Values:')
for key, value in result[4].items():
print('\t{}: {}'.format(key, value))
ADF Statistic on the entire dataset: -5.540182217095076 p-value: 1.7086667139689457e-06 Critical Values: 1%: -3.4324369453939885 5%: -2.862462083609354 10%: -2.567260846851735
Facebook Prophet was used in the making of the forecast
forecast_df.head()
| dt | AverageTemperature | AverageTemperatureUncertainty | City | Country | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 1849-01-01 | 26.704 | 1.435 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 1 | 1849-02-01 | 27.434 | 1.362 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 2 | 1849-03-01 | 28.101 | 1.612 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 3 | 1849-04-01 | 26.140 | 1.387 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
| 4 | 1849-05-01 | 25.427 | 1.200 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W |
forecast_df = forecast_df.reset_index()
forecast_df = forecast_df.drop(columns=['index'])
forecast_df.dt = pd.to_datetime(forecast_df.dt)
YEAR = []
MONTH = []
DAY = []
WEEKDAY = []
for i in range(len(forecast_df)):
WEEKDAY.append(forecast_df.dt[i].weekday())
DAY.append(forecast_df.dt[i].day)
MONTH.append(forecast_df.dt[i].month)
YEAR.append(forecast_df.dt[i].year)
forecast_df['Year'] = YEAR
forecast_df['Month'] = MONTH
forecast_df['Day'] = DAY
forecast_df['Weekday'] = WEEKDAY
change_year_index = []
change_year = []
year_list = italy_country['Year'].tolist()
for y in range(0,len(year_list)-1):
if year_list[y]!=year_list[y+1]:
change_year.append(year_list[y+1])
change_year_index.append(y+1)
forecast_df.loc[change_year_index].head()
| dt | AverageTemperature | AverageTemperatureUncertainty | City | Country | Latitude | Longitude | Year | Month | Day | Weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1849-02-01 | 27.434 | 1.362 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W | 1849 | 2 | 1 | 3 |
| 9 | 1849-10-01 | 25.263 | 1.175 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W | 1849 | 10 | 1 | 0 |
| 13 | 1850-02-01 | 27.890 | 1.430 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W | 1850 | 2 | 1 | 4 |
| 24 | 1851-01-01 | 26.789 | 1.249 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W | 1851 | 1 | 1 | 2 |
| 32 | 1851-09-01 | 24.083 | 1.121 | Abidjan | Côte D'Ivoire | 5.63N | 3.23W | 1851 | 9 | 1 | 0 |
forecast_df=forecast_df[forecast_df.Year>=1950]
forecast_df = forecast_df.dropna(how='all')
forecast_df = forecast_df.reset_index(drop=True)
print(forecast_df.dtypes)
dt datetime64[ns] AverageTemperature float64 AverageTemperatureUncertainty float64 City object Country object Latitude object Longitude object Year int64 Month int64 Day int64 Weekday int64 dtype: object
df_prophet = forecast_df[['AverageTemperature', 'dt']]
from prophet import Prophet
jh = df_prophet.rename(columns={'dt': 'ds', 'AverageTemperature': 'y'})
jh_model = Prophet(interval_width=0.95)
jh_model.fit(jh)
23:34:52 - cmdstanpy - INFO - Chain [1] start processing 23:35:04 - cmdstanpy - INFO - Chain [1] done processing
<prophet.forecaster.Prophet at 0x246431c9a60>
jh_forecast = jh_model.make_future_dataframe(periods=36, freq='MS')
jh_forecast = jh_model.predict(jh_forecast)
plt.figure(figsize=(18, 6))
jh_model.plot(jh_forecast, xlabel = 'Date', ylabel = 'Temperature')
plt.title('Temperature of the World Rates')
Text(0.5, 1.0, 'Temperature of the World Rates')
<Figure size 1296x432 with 0 Axes>
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df.index
DatetimeIndex(['1849-01-01', '1849-02-01', '1849-03-01', '1849-04-01',
'1849-05-01', '1849-06-01', '1849-07-01', '1849-08-01',
'1849-09-01', '1849-10-01',
...
'2012-11-01', '2012-12-01', '2013-01-01', '2013-02-01',
'2013-03-01', '2013-04-01', '2013-05-01', '2013-06-01',
'2013-07-01', '2013-08-01'],
dtype='datetime64[ns]', name='Date', length=228175, freq=None)
df['Year'] = df.index.year
latest_df = df.loc['1980':'2013']
#A glance of statistics inference from the data
df.describe()
latest_df.describe()
| Avg_temp | AvgTempUncertainty | Year | |
|---|---|---|---|
| count | 40407.000000 | 40407.000000 | 40407.000000 |
| mean | 19.463014 | 0.363776 | 1996.339520 |
| std | 9.565626 | 0.208110 | 9.721872 |
| min | -23.495000 | 0.056000 | 1980.000000 |
| 25% | 14.556500 | 0.236000 | 1988.000000 |
| 50% | 21.841000 | 0.315000 | 1996.000000 |
| 75% | 26.769000 | 0.430000 | 2005.000000 |
| max | 38.283000 | 3.963000 | 2013.000000 |
Each of the estimates covered above on the data is in a single number to describe the location or variability of the data. It is also useful to explore how the data is distributed overall.
Boxplots are based on percentiles and give a quick way to visualize the distribution of data. The median is shown by the horizontal line in the box. The dashed lines, referred to as whiskers, extend from the top and bottom to indicate the range for the bulk of the data.
fig, ax = plt.subplots(figsize=(10, 10))
ax.boxplot(latest_df["Avg_temp"])
{'whiskers': [<matplotlib.lines.Line2D at 0x24647769fa0>,
<matplotlib.lines.Line2D at 0x2464776d370>],
'caps': [<matplotlib.lines.Line2D at 0x2464776d6a0>,
<matplotlib.lines.Line2D at 0x2464776da30>],
'boxes': [<matplotlib.lines.Line2D at 0x24647769c10>],
'medians': [<matplotlib.lines.Line2D at 0x2464776ddc0>],
'fliers': [<matplotlib.lines.Line2D at 0x2464776f190>],
'means': []}
#Density plot considered as smooth histogram
hist_withf = sns.distplot(latest_df["Avg_temp"], bins=10, kde = True)
#Look at the dataset on temperatyre by state
df1 = pd.read_csv('GlobalLandTemperaturesByState.csv')
df1.head()
| dt | AverageTemperature | AverageTemperatureUncertainty | State | Country | |
|---|---|---|---|---|---|
| 0 | 1855-05-01 | 25.544 | 1.171 | Acre | Brazil |
| 1 | 1855-06-01 | 24.228 | 1.103 | Acre | Brazil |
| 2 | 1855-07-01 | 24.371 | 1.044 | Acre | Brazil |
| 3 | 1855-08-01 | 25.427 | 1.073 | Acre | Brazil |
| 4 | 1855-09-01 | 25.675 | 1.014 | Acre | Brazil |
df1.dtypes
dt object AverageTemperature float64 AverageTemperatureUncertainty float64 State object Country object dtype: object
df1.shape
(645675, 5)
df1.isnull().sum()
dt 0 AverageTemperature 25648 AverageTemperatureUncertainty 25648 State 0 Country 0 dtype: int64
df1 = df1.dropna(how='any', axis=0)
df1.shape
(620027, 5)
df1.rename(columns={'dt':'Date','AverageTemperature':'Avg_temp','AverageTemperatureUncertainty':'Internal_temp'}, inplace=True)
df1.head()
| Date | Avg_temp | Internal_temp | State | Country | |
|---|---|---|---|---|---|
| 0 | 1855-05-01 | 25.544 | 1.171 | Acre | Brazil |
| 1 | 1855-06-01 | 24.228 | 1.103 | Acre | Brazil |
| 2 | 1855-07-01 | 24.371 | 1.044 | Acre | Brazil |
| 3 | 1855-08-01 | 25.427 | 1.073 | Acre | Brazil |
| 4 | 1855-09-01 | 25.675 | 1.014 | Acre | Brazil |
Each of the estimates covered above on the data is in a single number to describe the location or variability of the data. It is also useful to explore how the data is distributed overall.
Boxplots are based on percentiles and give a quick way to visualize the distribution of data. The median is shown by the horizontal line in the box. The dashed lines, referred to as whiskers, extend from the top and bottom to indicate the range for the bulk of the data.
fig, ax = plt.subplots(figsize=(10, 10))
ax.boxplot(df1["Avg_temp"])
{'whiskers': [<matplotlib.lines.Line2D at 0x246477a3ca0>,
<matplotlib.lines.Line2D at 0x246477ad070>],
'caps': [<matplotlib.lines.Line2D at 0x246477ad400>,
<matplotlib.lines.Line2D at 0x246477ad790>],
'boxes': [<matplotlib.lines.Line2D at 0x246477a3910>],
'medians': [<matplotlib.lines.Line2D at 0x246477adb20>],
'fliers': [<matplotlib.lines.Line2D at 0x246477adeb0>],
'means': []}
#Density plot considered as smooth histogram
hist_withf = sns.distplot(df1["Avg_temp"], bins=10, kde = True)
df1['Date'] = pd.to_datetime(df1['Date'])
df1.set_index('Date', inplace=True)
df1.index
DatetimeIndex(['1855-05-01', '1855-06-01', '1855-07-01', '1855-08-01',
'1855-09-01', '1855-10-01', '1855-11-01', '1855-12-01',
'1856-01-01', '1856-02-01',
...
'2012-11-01', '2012-12-01', '2013-01-01', '2013-02-01',
'2013-03-01', '2013-04-01', '2013-05-01', '2013-06-01',
'2013-07-01', '2013-08-01'],
dtype='datetime64[ns]', name='Date', length=620027, freq=None)
df1.describe()
| Avg_temp | Internal_temp | |
|---|---|---|
| count | 620027.000000 | 620027.000000 |
| mean | 8.993111 | 1.287647 |
| std | 13.772150 | 1.360392 |
| min | -45.389000 | 0.036000 |
| 25% | -0.693000 | 0.316000 |
| 50% | 11.199000 | 0.656000 |
| 75% | 19.899000 | 1.850000 |
| max | 36.339000 | 12.646000 |
df1['Year'] = df1.index.year
df1.head()
| Avg_temp | Internal_temp | State | Country | Year | |
|---|---|---|---|---|---|
| Date | |||||
| 1855-05-01 | 25.544 | 1.171 | Acre | Brazil | 1855 |
| 1855-06-01 | 24.228 | 1.103 | Acre | Brazil | 1855 |
| 1855-07-01 | 24.371 | 1.044 | Acre | Brazil | 1855 |
| 1855-08-01 | 25.427 | 1.073 | Acre | Brazil | 1855 |
| 1855-09-01 | 25.675 | 1.014 | Acre | Brazil | 1855 |
df1.describe()
| Avg_temp | Internal_temp | Year | |
|---|---|---|---|
| count | 620027.000000 | 620027.000000 | 620027.000000 |
| mean | 8.993111 | 1.287647 | 1902.331598 |
| std | 13.772150 | 1.360392 | 67.831393 |
| min | -45.389000 | 0.036000 | 1743.000000 |
| 25% | -0.693000 | 0.316000 | 1851.000000 |
| 50% | 11.199000 | 0.656000 | 1906.000000 |
| 75% | 19.899000 | 1.850000 | 1960.000000 |
| max | 36.339000 | 12.646000 | 2013.000000 |
latest_df1 = df1.loc['1980':'2013']
latest_df1.head()
| Avg_temp | Internal_temp | State | Country | Year | |
|---|---|---|---|---|---|
| Date | |||||
| 1980-01-01 | 26.652 | 0.190 | Acre | Brazil | 1980 |
| 1980-02-01 | 26.495 | 0.495 | Acre | Brazil | 1980 |
| 1980-03-01 | 26.270 | 0.236 | Acre | Brazil | 1980 |
| 1980-04-01 | 26.430 | 0.201 | Acre | Brazil | 1980 |
| 1980-05-01 | 25.802 | 0.882 | Acre | Brazil | 1980 |
latest_df1[['Country','Avg_temp']].groupby(['Country']).mean().sort_values('Avg_temp')
| Avg_temp | |
|---|---|
| Country | |
| Canada | -0.681256 |
| Russia | 2.432833 |
| United States | 11.516373 |
| China | 12.150210 |
| Australia | 18.447621 |
| India | 22.871669 |
| Brazil | 24.537580 |
resample_df = latest_df1[['Avg_temp']].resample('A').mean()
resample_df.head()
| Avg_temp | |
|---|---|
| Date | |
| 1980-12-31 | 9.689856 |
| 1981-12-31 | 10.366578 |
| 1982-12-31 | 9.837958 |
| 1983-12-31 | 10.259516 |
| 1984-12-31 | 9.724950 |
resample_df.plot(title='Changes in Temperature from 1980-2013',figsize=(8,5))
plt.ylabel('Temperature',fontsize=12)
plt.xlabel('Year',fontsize=12)
plt.legend()
<matplotlib.legend.Legend at 0x24647ce46a0>
###stationarity
from statsmodels.tsa.stattools import adfuller
print('The Dickey Fuller Test Results: ')
test_data = adfuller(resample_df.iloc[:,0].values, autolag='AIC')
output_data = pd.Series(test_data[0:4], index=['Test Statistic', 'P-value','Lags Used','Number of Observations Used'])
for key, value in test_data[4].items():
output_data['Critical Value (%s)' %key] = value
print(output_data)
The Dickey Fuller Test Results: Test Statistic -0.728209 P-value 0.839307 Lags Used 2.000000 Number of Observations Used 31.000000 Critical Value (1%) -3.661429 Critical Value (5%) -2.960525 Critical Value (10%) -2.619319 dtype: float64
Here it can be noted that the test statistic is greater than the critical value. therefore have failed to reject the null hypothesis at this point.Hence, time series is not stationary.
###plot on decompose
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(resample_df, freq=3)
trend = decomp.trend
seasonal = decomp.seasonal
residual = decomp.resid
plt.subplot(411)
plt.plot(resample_df)
plt.xlabel('Original')
plt.figure(figsize=(6,5))
plt.subplot(412)
plt.plot(trend)
plt.xlabel('Trend')
plt.figure(figsize=(6,5))
plt.subplot(413)
plt.plot(seasonal)
plt.xlabel('Seasonal')
plt.figure(figsize=(6,5))
plt.subplot(414)
plt.plot(residual)
plt.xlabel('Residual')
plt.figure(figsize=(6,5))
plt.tight_layout()
<Figure size 432x360 with 0 Axes>
# transform our data, by using moving average and exponentiall smoothing.
rol_mean = resample_df.rolling(window=3, center=True).mean()
# Exponential weighted mean
ewm = resample_df.ewm(span=3).mean()
# Rolling standard deviation
rol_std = resample_df.rolling(window=3, center=True).std()
# Creation of subplots next to each other
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,5))
#Tempearture graph of rolling mean and exponential weighted mean
ax1.plot(resample_df, label='Original')
ax1.plot(rol_mean, label='Rolling Mean')
ax1.plot(ewm, label='Exponential Weighted Mean')
ax1.set_title('Temperature Changes 1980-2013', fontsize=14)
ax1.set_ylabel('Temperature', fontsize=12)
ax1.set_xlabel('Year', fontsize=12)
ax1.legend()
# Temperature graph with rolling STD
ax2.plot(rol_std,label='Rolling STD' )
ax2.set_title('Temperature Changes 1980-2013', fontsize=14)
ax2.set_ylabel('Temperature', fontsize=12)
ax2.set_xlabel('Year', fontsize=12)
ax2.legend()
plt.tight_layout()
plt.show()
# Now reapply the Dickey fuller test to check the hypothesis
rol_mean.dropna(inplace=True)
ewm.dropna(inplace=True)
print('Dickey Fuller Test for the rolling mean')
df_test = adfuller(rol_mean.iloc[:,0].values, autolag='AIC')
df_output = pd.Series(df_test[0:4], index=['Test Statistic', 'P-value','Lags Used','Number of Observations Used'])
for key, value in df_test[4].items():
df_output['Critical Value (%s)' %key] = value
print(df_output)
print('-----------------------')
print('Dickey Fuller Test for the Exponential weighted mean')
df_test = adfuller(ewm.iloc[:,0].values, autolag='AIC')
df_output = pd.Series(df_test[0:4], index=['Test Statistic', 'P-value','Lags Used','Number of Observations Used'])
for key, value in df_test[4].items():
df_output['Critical Value (%s)' %key] = value
print(df_output)
#It can be still noted that Test statistic is greater than the critical value, therefore failed to reject the null hypothesis
Dickey Fuller Test for the rolling mean Test Statistic 0.275101 P-value 0.976173 Lags Used 8.000000 Number of Observations Used 23.000000 Critical Value (1%) -3.752928 Critical Value (5%) -2.998500 Critical Value (10%) -2.638967 dtype: float64 ----------------------- Dickey Fuller Test for the Exponential weighted mean Test Statistic -0.338693 P-value 0.919843 Lags Used 2.000000 Number of Observations Used 31.000000 Critical Value (1%) -3.661429 Critical Value (5%) -2.960525 Critical Value (10%) -2.619319 dtype: float64
# Using differencing to remove the rolling mean or exponential from the original time series.
diff_rol_mean = resample_df - rol_mean
diff_rol_mean.dropna(inplace=True)
diff_rol_mean.head()
| Avg_temp | |
|---|---|
| Date | |
| 1981-12-31 | 0.401781 |
| 1982-12-31 | -0.316726 |
| 1983-12-31 | 0.318708 |
| 1984-12-31 | -0.091684 |
| 1985-12-31 | -0.236199 |
diff_ewm = resample_df - ewm
diff_ewm.dropna(inplace=True)
diff_ewm.head()
| Avg_temp | |
|---|---|
| Date | |
| 1980-12-31 | 0.000000 |
| 1981-12-31 | 0.225574 |
| 1982-12-31 | -0.129877 |
| 1983-12-31 | 0.136118 |
| 1984-12-31 | -0.192797 |
#Next is the plotting of the difference graph
df_rol_mean_diff = diff_rol_mean.rolling(window=3, center=True).mean()
# Exponential weighted mean
df_ewm_mean_diff = diff_ewm.ewm(span=3).mean()
# Creation of subplots next to each other
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,5))
#Tempearture graph of rolling mean
ax1.plot(diff_rol_mean, label='Original')
ax1.plot(df_rol_mean_diff, label='Rolling Mean')
ax1.set_title('Temperature Changes !950-2013', fontsize=14)
ax1.set_ylabel('Temperature', fontsize=12)
ax1.set_xlabel('Year', fontsize=12)
ax1.legend()
# Temperature graph with exponential weighted mean
ax2.plot(diff_ewm,label='Original' )
ax2.plot(df_ewm_mean_diff,label='Exponential Weighted Mean' )
ax2.set_title('Temperature Changes 1950-2013', fontsize=14)
ax2.set_ylabel('Temperature', fontsize=12)
ax2.set_xlabel('Year', fontsize=12)
ax2.legend()
plt.tight_layout()
# Apply Dickey Fuller Test to check hypothesis
print('The Dickey Fuller Test for the Original and Rolling Mean: ')
test_data = adfuller(diff_rol_mean.iloc[:,0].values, autolag='AIC')
output_data = pd.Series(test_data[0:4], index=['Test Statistic', 'P-value','Lags Used','Number of Observations Used'])
for key, value in test_data[4].items():
output_data['Critical Value (%s)' %key] = value
print(output_data)
print('The Dickey Fuller Test for the Original and Exponential Weighted Mean: ')
test_data = adfuller(diff_ewm.iloc[:,0].values, autolag='AIC')
output_data = pd.Series(test_data[0:4], index=['Test Statistic', 'P-value','Lags Used','Number of Observations Used'])
for key, value in test_data[4].items():
output_data['Critical Value (%s)' %key] = value
print(output_data)
The Dickey Fuller Test for the Original and Rolling Mean: Test Statistic -7.007710e+00 P-value 7.051586e-10 Lags Used 1.000000e+00 Number of Observations Used 3.000000e+01 Critical Value (1%) -3.669920e+00 Critical Value (5%) -2.964071e+00 Critical Value (10%) -2.621171e+00 dtype: float64 The Dickey Fuller Test for the Original and Exponential Weighted Mean: Test Statistic -4.297446 P-value 0.000449 Lags Used 1.000000 Number of Observations Used 32.000000 Critical Value (1%) -3.653520 Critical Value (5%) -2.957219 Critical Value (10%) -2.617588 dtype: float64
Here it can be noted that the test statistics is less than the critical, therefore can reject the null hypoythsis, and confident the data is stationary.
#Autocorrelation and Partial
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from matplotlib import pyplot
#Plotting the graphs
pyplot.figure(figsize=(10,5))
pyplot.subplot(211)
plot_acf(resample_df, ax=pyplot.gca())
pyplot.subplot(212)
plot_pacf(resample_df, ax=pyplot.gca())
pyplot.show()